HW2:Wine Features

Author

Jacob Plax

Published

February 3, 2025

Abstract:

This is a technical blog post of both an HTML file and .qmd file hosted on GitHub pages.

Setup

Step Up Code:

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
Warning: package 'caret' was built under R version 4.3.3
Loading required package: lattice

Attaching package: 'caret'

The following object is masked from 'package:purrr':

    lift
library(fastDummies)
Warning: package 'fastDummies' was built under R version 4.3.3
wine <- readRDS(gzcon(url("https://github.com/cd-public/D505/raw/master/dat/wine.rds")))

Feature Engineering

We begin by engineering an number of features.

  1. Create a total of 10 features (including points).
  2. Remove all rows with a missing value.
  3. Ensure only log(price) and engineering features are the only columns that remain in the wino dataframe.
wino <- wine %>% 
  mutate(lprice=log(price)) %>%
  mutate(country = fct_lump(country, 4)) %>%    
  mutate(variety = fct_lump(variety, 4)) %>%                
  select(lprice, points, country, variety) %>%
  drop_na()
head(wino)
# A tibble: 6 × 4
  lprice points country variety   
   <dbl>  <dbl> <fct>   <fct>     
1   2.71     87 Other   Other     
2   2.64     87 US      Other     
3   2.56     87 US      Other     
4   4.17     87 US      Pinot Noir
5   2.71     87 Spain   Other     
6   2.77     87 Italy   Other     

Explanataion:

  1. We create a new column lprice which is the logarithm of the price column.
  2. We lump the country column into the top 4 most common countries and group the rest into “Other”.
  3. We lump the variety column into the top 4 most common varieties and group the rest into “Other”.
  4. We select only the lprice, points, country, and variety columns.
  5. We remove any rows that contain missing values.
  6. Finally, we display the first few rows of the resulting wino dataframe using the head function.

Caret

We now use a train/test split to evaluate the features.

  1. Use the Caret library to partition the wino dataframe into an 80/20 split.
  2. Run a linear regression with bootstrap resampling.
  3. Report RMSE on the test partition of the data.
set.seed(123)

trainIndex <- createDataPartition(wino$lprice, p = 0.8, list = FALSE)
wino_train <- wino[trainIndex, ]
wino_test <- wino[-trainIndex, ]

train_control <- trainControl(method = "boot", number = 100)
model <- train(lprice ~ ., data = wino_train, method = "lm", trControl = train_control)

predictions <- predict(model, wino_test)
rmse <- sqrt(mean((wino_test$lprice - predictions)^2))
rmse
[1] 0.4902949

Explanation

  1. We set a seed for reproducibility.
  2. We create a training index that partitions the wino dataframe into an 80/20 split.
  3. We create training and testing datasets using the partition index.
  4. We define the training control using bootstrap resampling with 100 iterations.
  5. We train a linear regression model using the training data and the defined training control.
  6. We make predictions on the test data using the trained model.
  7. We calculate the Root Mean Squared Error (RMSE) to evaluate the model’s performance on the test data.

Variable selection

We now graph the importance of your 10 features.

plot(varImp(model, scale = FALSE))

Explanation

  1. We use the varImp function from the caret package to calculate the importance of each feature in the model.
  2. We plot the variable importance using the plot function, which helps us visualize the significance of each feature in predicting the target variable lprice.